1 Executive summary

We have chosen Copenhagen as our city. Based on the analysis of the data set we conclude that the best predictors of AirBnB prices for a 4 night stay for 2 people in Copenhagen are:

  1. Type of property – the bigger and more private the property is the higher the price
  2. Number of reviews and review scores – this is used as a proxy for the quality of the property as the higher the number of review and ratings the better the quality of the property
  3. Number of people the property accommodates – properties that accommodate more people tend to have higher prices
  4. Neighborhood the property is located – more exclusive/ suburbs with properties located near the city have higher prices compared to others
  5. Availability of the property in the next 30 days – properties available within the next 30 days tend to cost more as it is assumed the person is willing to pay more as the trip was likely booked without planning whereas more than 30 days the person is usually flexible and will likely look for bargains before booking
  6. Reviews per month - Property which has more reviews tend to have a lower price, this may because some negative reviews may affect the reputation of the property

To conduct our analysis and come up with model that best predicts the price we analysed the data set, and selected variables that drive prices from a logical point of view and used those for our base model. From this base model we conducted statistical analysis of all the other variable looking at the correlations to prices and added variables that were correlated with prices and improved on our models. Although more variables that the 6 variables included above were correlated to prices most of them did not significantly affect the price most likely due to their own correlation to variables already in the model thus we omitted them.

In conclusion, the model that best predicts prices is log(price_4_nights) = prop_type_simplified + number_of_reviews+ review_scores_rating + room_type + accommodates + neighbourhood_cleansed_simplified+ availability_30+ reviews_per_month and it explains 52% of the variation in prices of Airbnb rentals for a 4 night stay for 2 people in Copenhagen

2 Data Wrangling

2.1 Loading the data

Interpretation: Number of rows(observations): 9625; Number of columns(variables): 74; Number of numeric variables: 37; Number of character variables: 23; Number of date variables: 5; Number of logical variables (factor variables): 9.

Note: logical variables have a fixed or known set of possible values, thus they are also factor variables, for example, host_is_superhost(TRUE/FALSE), host_has_profile_pic(TRUE/FALSE)

2.2 Examining the raw values of the data set

#variables / columns
dplyr::glimpse(listings)
Rows: 9,625
Columns: 74
$ id                                           <dbl> 6983, 26057, 29118, 31094…
$ listing_url                                  <chr> "https://www.airbnb.com/r…
$ scrape_id                                    <dbl> 2.021093e+13, 2.021093e+1…
$ last_scraped                                 <date> 2021-09-30, 2021-09-30, …
$ name                                         <chr> "Copenhagen 'N Livin'", "…
$ description                                  <chr> "Lovely apartment located…
$ neighborhood_overview                        <chr> "Nice bars and cozy cafes…
$ picture_url                                  <chr> "https://a0.muscache.com/…
$ host_id                                      <dbl> 16774, 109777, 125230, 12…
$ host_url                                     <chr> "https://www.airbnb.com/u…
$ host_name                                    <chr> "Simon", "Kari", "Nana", …
$ host_since                                   <date> 2009-05-12, 2010-04-17, …
$ host_location                                <chr> "Copenhagen, Capital Regi…
$ host_about                                   <chr> "I'm currently working as…
$ host_response_time                           <chr> "N/A", "N/A", "within a f…
$ host_response_rate                           <chr> "N/A", "N/A", "100%", "N/…
$ host_acceptance_rate                         <chr> "N/A", "N/A", "50%", "0%"…
$ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, FALS…
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/…
$ host_picture_url                             <chr> "https://a0.muscache.com/…
$ host_neighbourhood                           <chr> "Nørrebro", "Indre By", "…
$ host_listings_count                          <dbl> 1, 1, 1, 1, 3, 1, 0, 1, 2…
$ host_total_listings_count                    <dbl> 1, 1, 1, 1, 3, 1, 0, 1, 2…
$ host_verifications                           <chr> "['email', 'phone', 'revi…
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ host_identity_verified                       <lgl> FALSE, TRUE, TRUE, TRUE, …
$ neighbourhood                                <chr> "Copenhagen, Hovedstaden,…
$ neighbourhood_cleansed                       <chr> "Nrrebro", "Indre By", "V…
$ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, N…
$ latitude                                     <dbl> 55.68641, 55.69196, 55.67…
$ longitude                                    <dbl> 12.54741, 12.57637, 12.55…
$ property_type                                <chr> "Private room in rental u…
$ room_type                                    <chr> "Private room", "Entire h…
$ accommodates                                 <dbl> 2, 6, 2, 3, 5, 4, 4, 4, 1…
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N…
$ bathrooms_text                               <chr> "1 shared bath", "1.5 bat…
$ bedrooms                                     <dbl> 1, 4, 1, 1, 3, 2, 1, 2, N…
$ beds                                         <dbl> 1, 4, 1, 3, 4, 2, 1, 3, 1…
$ amenities                                    <chr> "[\"Cooking basics\", \"W…
$ price                                        <chr> "$370.00", "$2,400.00", "…
$ minimum_nights                               <dbl> 2, 4, 7, 2, 3, 100, 6, 5,…
$ maximum_nights                               <dbl> 15, 1125, 14, 10, 365, 11…
$ minimum_minimum_nights                       <dbl> 2, 4, 3, 2, 3, 100, 6, 5,…
$ maximum_minimum_nights                       <dbl> 2, 4, 5, 2, 3, 100, 6, 5,…
$ minimum_maximum_nights                       <dbl> 15, 1125, 14, 10, 365, 11…
$ maximum_maximum_nights                       <dbl> 15, 1125, 14, 10, 365, 11…
$ minimum_nights_avg_ntm                       <dbl> 2.0, 4.0, 4.1, 2.0, 3.0, …
$ maximum_nights_avg_ntm                       <dbl> 15, 1125, 14, 10, 365, 11…
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N…
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ availability_30                              <dbl> 0, 17, 0, 0, 7, 0, 7, 23,…
$ availability_60                              <dbl> 0, 45, 0, 0, 10, 0, 23, 5…
$ availability_90                              <dbl> 0, 69, 15, 0, 10, 14, 36,…
$ availability_365                             <dbl> 0, 340, 101, 0, 12, 289, …
$ calendar_last_scraped                        <date> 2021-09-30, 2021-09-30, …
$ number_of_reviews                            <dbl> 168, 51, 22, 17, 75, 7, 7…
$ number_of_reviews_ltm                        <dbl> 0, 1, 0, 0, 2, 0, 0, 0, 0…
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ first_review                                 <date> 2013-01-02, 2016-02-06, …
$ last_review                                  <date> 2018-11-23, 2019-12-14, …
$ review_scores_rating                         <dbl> 4.78, 4.90, 4.91, 4.87, 4…
$ review_scores_accuracy                       <dbl> 4.78, 4.91, 4.85, 4.80, 4…
$ review_scores_cleanliness                    <dbl> 4.78, 4.96, 4.77, 4.87, 4…
$ review_scores_checkin                        <dbl> 4.87, 4.91, 5.00, 4.85, 4…
$ review_scores_communication                  <dbl> 4.90, 4.83, 5.00, 4.80, 4…
$ review_scores_location                       <dbl> 4.72, 4.96, 4.85, 4.85, 4…
$ review_scores_value                          <dbl> 4.71, 4.80, 4.77, 4.46, 4…
$ license                                      <lgl> NA, NA, NA, NA, NA, NA, N…
$ instant_bookable                             <lgl> FALSE, FALSE, FALSE, FALS…
$ calculated_host_listings_count               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ calculated_host_listings_count_entire_homes  <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1…
$ calculated_host_listings_count_private_rooms <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reviews_per_month                            <dbl> 1.58, 0.74, 0.35, 0.26, 0…

2.3 Description of variables

You can find a full data dictionary here. Below are the definitions of some of the most important variables:

  • price = cost per night in Danish krone

  • property_type: type of accommodation (House, Apartment, etc.)

  • room_type:

    • Entire home/apt (guests have entire place to themselves)
    • Private room (Guests have private room to sleep, all other rooms shared)
    • Shared room (Guests sleep in room shared with others)
  • number_of_reviews: Total number of reviews for the listing

  • review_scores_rating: Average review score (0 - 100)

  • longitude , latitude: geographical coordinates to help us locate the listing

  • neighbourhood: three variables on a few major neighborhoods in each city

2.4 Creating a clean data set

2.4.1 Converting character variables to numeric variables

## converting 'price' to a numeric variable
listings_clean <- listings %>% 
  mutate(price = readr::parse_number(price)) %>%

##dropping non-numeric characters before/after the first number from variable 'bathrooms'
  mutate(bathrooms_text=replace(bathrooms_text, bathrooms_text=="Shared half-bath", 0.5)) %>%  
  mutate(bathrooms_text=replace(bathrooms_text, bathrooms_text=="Half-bath", 0.5)) %>%
  mutate(bathrooms_text=replace(bathrooms_text, bathrooms_text=="Private half-bath", 0.5)) %>%

## converting 'bathrooms' to a numeric variable
  mutate(bathrooms = readr::parse_number(bathrooms_text)) 

2.4.2 Confirming price is a numeric variable

#check price is a number
typeof(listings_clean$price)
[1] "double"

3 Exploratory Data Analysis

3.1 Computing summary statistics and finding missing values

skimr::skim(listings)
Data summary
Name listings
Number of rows 9625
Number of columns 74
_______________________
Column type frequency:
character 23
Date 5
logical 9
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 9625 0
name 1 1.00 1 248 0 9357 0
description 329 0.97 2 1000 0 9179 0
neighborhood_overview 4242 0.56 4 1000 0 5163 0
picture_url 0 1.00 61 126 0 9525 0
host_url 0 1.00 39 43 0 8452 0
host_name 3 1.00 1 28 0 2808 0
host_location 19 1.00 2 119 0 397 0
host_about 4267 0.56 1 6639 0 4464 10
host_response_time 3 1.00 3 18 0 5 0
host_response_rate 3 1.00 2 4 0 62 0
host_acceptance_rate 3 1.00 2 4 0 97 0
host_thumbnail_url 3 1.00 55 106 0 8384 0
host_picture_url 3 1.00 57 109 0 8384 0
host_neighbourhood 4118 0.57 5 20 0 29 0
host_verifications 0 1.00 2 158 0 252 0
neighbourhood 4242 0.56 7 55 0 181 0
neighbourhood_cleansed 0 1.00 5 25 0 11 0
property_type 0 1.00 4 35 0 47 0
room_type 0 1.00 10 15 0 4 0
bathrooms_text 15 1.00 6 17 0 21 0
amenities 0 1.00 2 1612 0 9308 0
price 0 1.00 5 11 0 1387 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2021-09-30 2021-09-30 2021-09-30 1
host_since 3 1.00 2009-05-12 2021-09-27 2015-06-26 3040
calendar_last_scraped 0 1.00 2021-09-30 2021-09-30 2021-09-30 1
first_review 1382 0.86 2011-07-09 2021-09-29 2019-01-28 2121
last_review 1382 0.86 2011-07-21 2021-09-30 2019-12-23 1551

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 3 1 0.14 FAL: 8269, TRU: 1353
host_has_profile_pic 3 1 0.99 TRU: 9556, FAL: 66
host_identity_verified 3 1 0.79 TRU: 7607, FAL: 2015
neighbourhood_group_cleansed 9625 0 NaN :
bathrooms 9625 0 NaN :
calendar_updated 9625 0 NaN :
has_availability 0 1 0.98 TRU: 9432, FAL: 193
license 9625 0 NaN :
instant_bookable 0 1 0.21 FAL: 7643, TRU: 1982

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.720424e+07 16424268.66 6.983000e+03 1.307881e+07 2.730204e+07 4.201887e+07 5.251236e+07 ▇▆▅▆▇
scrape_id 0 1.00 2.021093e+13 0.00 2.021093e+13 2.021093e+13 2.021093e+13 2.021093e+13 2.021093e+13 ▁▁▇▁▁
host_id 0 1.00 8.435638e+07 102624593.48 1.677400e+04 1.050400e+07 3.627811e+07 1.324883e+08 4.248214e+08 ▇▂▁▁▁
host_listings_count 3 1.00 1.077000e+01 53.03 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 3.460000e+02 ▇▁▁▁▁
host_total_listings_count 3 1.00 1.077000e+01 53.03 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 3.460000e+02 ▇▁▁▁▁
latitude 0 1.00 5.568000e+01 0.02 5.562000e+01 5.567000e+01 5.568000e+01 5.569000e+01 5.573000e+01 ▁▃▇▆▁
longitude 0 1.00 1.256000e+01 0.03 1.245000e+01 1.254000e+01 1.256000e+01 1.258000e+01 1.264000e+01 ▁▂▇▆▂
accommodates 0 1.00 3.450000e+00 1.77 0.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.600000e+01 ▇▆▁▁▁
bedrooms 218 0.98 1.680000e+00 1.39 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.010000e+02 ▇▁▁▁▁
beds 63 0.99 2.060000e+00 1.51 0.000000e+00 1.000000e+00 2.000000e+00 3.000000e+00 2.500000e+01 ▇▁▁▁▁
minimum_nights 0 1.00 4.590000e+00 20.84 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.111000e+03 ▇▁▁▁▁
maximum_nights 0 1.00 5.637100e+02 536.17 1.000000e+00 2.000000e+01 3.650000e+02 1.125000e+03 4.000000e+03 ▇▇▁▁▁
minimum_minimum_nights 1 1.00 4.610000e+00 20.85 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.111000e+03 ▇▁▁▁▁
maximum_minimum_nights 1 1.00 5.120000e+00 25.63 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.400000e+03 ▇▁▁▁▁
minimum_maximum_nights 1 1.00 6.481200e+02 534.59 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 4.000000e+03 ▆▇▁▁▁
maximum_maximum_nights 1 1.00 6.578100e+02 532.65 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 4.000000e+03 ▆▇▁▁▁
minimum_nights_avg_ntm 1 1.00 4.770000e+00 21.01 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.111000e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 1 1.00 6.543700e+02 532.30 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 4.000000e+03 ▆▇▁▁▁
availability_30 0 1.00 5.930000e+00 9.59 0.000000e+00 0.000000e+00 0.000000e+00 9.000000e+00 3.000000e+01 ▇▁▁▁▁
availability_60 0 1.00 1.374000e+01 20.60 0.000000e+00 0.000000e+00 0.000000e+00 2.500000e+01 6.000000e+01 ▇▁▁▁▂
availability_90 0 1.00 2.293000e+01 32.05 0.000000e+00 0.000000e+00 2.000000e+00 4.400000e+01 9.000000e+01 ▇▁▁▁▂
availability_365 0 1.00 1.007200e+02 125.83 0.000000e+00 0.000000e+00 3.100000e+01 1.790000e+02 3.650000e+02 ▇▂▁▁▂
number_of_reviews 0 1.00 1.984000e+01 35.88 0.000000e+00 2.000000e+00 8.000000e+00 2.300000e+01 6.600000e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 2.200000e+00 5.12 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+00 1.680000e+02 ▇▁▁▁▁
number_of_reviews_l30d 0 1.00 4.000000e-01 1.12 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.700000e+01 ▇▁▁▁▁
review_scores_rating 1382 0.86 4.740000e+00 0.55 0.000000e+00 4.680000e+00 4.860000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_accuracy 1457 0.85 4.830000e+00 0.29 1.000000e+00 4.780000e+00 4.920000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_cleanliness 1457 0.85 4.690000e+00 0.41 1.000000e+00 4.560000e+00 4.800000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_checkin 1457 0.85 4.880000e+00 0.27 1.000000e+00 4.860000e+00 4.970000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_communication 1457 0.85 4.900000e+00 0.27 1.000000e+00 4.890000e+00 5.000000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_location 1458 0.85 4.820000e+00 0.27 1.000000e+00 4.750000e+00 4.890000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_value 1458 0.85 4.710000e+00 0.33 1.000000e+00 4.600000e+00 4.770000e+00 4.920000e+00 5.000000e+00 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 6.030000e+00 26.19 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.830000e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 5.710000e+00 26.22 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.830000e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 2.900000e-01 0.94 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.100000e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 1.000000e-02 0.10 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+00 ▇▁▁▁▁
reviews_per_month 1382 0.86 8.500000e-01 1.26 1.000000e-02 2.100000e-01 4.600000e-01 9.700000e-01 2.600000e+01 ▇▁▁▁▁


Interpretation:

We have 23 character variables that specify information about the property hosts, such as their name, location, response rate, picture and bio. We observe that 4 variables are related to the neighborhood of the properties, and they have more than 4000 missing values. We can also observe that over 4000 hosts do not have their “about” section filled. Most importantly, we see that ‘price’ is in the character variable section, instead of numeric variables.

In the logical variables, we find values that have fixed or known set of values, such as TRUE and FALSE. We observe that 4 logical variables have 9625 missing values, which implies that they are completely empty since we have 9625 rows in total in our dataset. These variables are neighbourhood_group_cleansed, bathrooms, calendar_updated and license.

In the numeric variables, we observe that more than 1400 values are missing from the 5 review scores variables, which indicates that almost 15% of the tenants do not leave a review. We also see that 218 properties do not have their number of bedrooms listed, which is an important variable for calculating the price of the property.

3.2 Overview of variables of interest

Note: We are examining 5 variables of interest, namely: 1. price 2. number of beds 3. room capacity 4. minimum nights 5. maximum nights

For now, we have chosen these variables based on their potential ability to explain our target variable** \(Y\), the cost for 2 people to stay at an Airbnb location for 4 nights.

3.2.1 Price

mosaic::favstats(listings_clean$price)
minQ1medianQ3maxmeansdnmissing
06098501.2e+031e+051.09e+032.13e+0396250

Interpretation: Since mean > median, the distribution of ‘price’ is positively skewed. Minimum price is 0, which is odd because property listings cannot be done for free.

3.2.2 Number of beds

mosaic::favstats(listings$beds)
minQ1medianQ3maxmeansdnmissing
0123252.061.51956263

Interpretation: Since mean > median, the distribution of ‘beds’ is positively skewed. Minimum number of beds is 0, which is odd because every listed property should have atleast 1 available bed. We observe that the maximum number of beds are 25, while the median is 2, which implies that the data is positively skewed, with outliers on the higher side of the mean.

3.2.3 Room capacity

mosaic::favstats(listings$accommodates)
minQ1medianQ3maxmeansdnmissing
0234163.451.7796250

Interpretation: Since mean > median, the distribution of ‘accommodates’ is positively skewed. We want to calculate the price of 4 nights for 2 people, hence all rooms that accommodate lesser than 2 people are not of interest. We observe that minimum room capacity is 0, and Q1 is 2, so we can conclude that we approximately exclude the first quartile of the room capacity data in our analysis.

3.2.4 Minimum nights

mosaic::favstats(listings$minimum_nights)
minQ1medianQ3maxmeansdnmissing
12341.11e+034.5920.896250

Interpretation: Since mean > median, the distribution of ‘minimum_nights’ is positively skewed. Our target variable requires price of properties for 4 nights, thus we exclude all data with minimum_nights requirement of more than 4 from our analysis. We observe that maximum number of minimum_nights is 1111, which is odd because there are only 365 in a year.

3.2.5 Maximum nights

mosaic::favstats(listings$maximum_nights)
minQ1medianQ3maxmeansdnmissing
1203651.12e+034e+0356453696250

Interpretation: Since mean > median, the distribution of ‘maximum_nights’ is positively skewed. We need maximum_nights to exclude properties that have maximum_nights less than 4, because our target variable requires properties which allow stay of atleast 4 days.

3.3 Data visualisations

3.3.1 Examining Distribution of price

listings_clean%>%
  ggplot(aes(x=price),binwidth=10)+
  geom_histogram()+
  theme_minimal()+
  ggtitle("Histogram of price")+
  NULL

listings_clean%>%
  filter(price<=3500) %>%
  ggplot(aes(x=price),binwidth=10) +
  geom_histogram(alpha=0.7, colour="black")+
  theme_bw()+
  ggtitle("Histogram of price less than 3500")+
  NULL

3.3.2 Exploring numerical variables of interest

library(purrr)

listings_numeric <- listings %>%
  select("minimum_nights", "maximum_nights", "review_scores_rating", "number_of_reviews")%>%
  filter(minimum_nights<=30)%>%
  filter(maximum_nights<=1600) %>%
  filter(number_of_reviews<=200)

listings_numeric %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(color="black", fill="pink")+
    theme_bw()+
    geom_density(alpha=0.5)+
    NULL

3.3.3 Exploring factor variables using geom_bar

Logical variables that are factor variables: 1. host_is_superhost 2. host_has_profile_pic 3. host_identity_verified 4. has_availability 5. instant_bookable

Character variables that are factor variables: 1. property_type 2. room_type 3. bathrooms_text 4. host_response_time 5. host_neighborhood

# plotting bar graph of host_is_superhost
listings%>%
  filter(!is.na(host_is_superhost))%>%
  ggplot(aes(x=host_is_superhost))+
  geom_bar()+
  theme_bw()

# plotting bar graph of instant_bookable 
listings%>%
  filter(!is.na(instant_bookable))%>%
  ggplot(aes(x=instant_bookable))+
  geom_bar()+
  theme_bw()

3.3.4 Exploring factor variables using geom_col

##review_scores_rating with different host_is_superhost levels
listings%>%
  filter(!is.na(host_is_superhost))%>%
  ggplot(aes(x=host_is_superhost, y=review_scores_rating))+
  geom_col()+
  theme_bw()

##number_of_reviews with different host_identity_verified levels
listings%>%
  filter(!is.na(host_identity_verified))%>%
  ggplot(aes(x=host_identity_verified, y=number_of_reviews)) +
  geom_col() +
  theme_bw()

3.4 Correlations

Interpretation:

The correlations between variables are surprisingly low as the numerical data fails to capture the intangible factors such as location and marketing (e.g. quality of description). The variable review_scores_location appears to explain 0.025 (2.5%) of the price and whilst this is not significant, the relationship could be explored later through a regression.

As seen from the chunk of code below, variables relating to reviews do not correlate with a higher price. Logically, this appears reasonable as a property, which has a high amount of ratings and high overall rating could be a low-priced property and vice versa. As a group, the relationship between the different sub-sections of ratings (e.g. review_scores_cleanliness and review_scores_communication) are surprisingly low at 0.51.

There are some variables, which are conditional on the value of a categorical value. For instance, “accommodates” and “beds” have 74.6% correlation. This is sensible as clearly the “accommodates” variable is the sum of beds and structure of bedrooms. The variables, which are closest to a perfect correlation at 0.999 are “calculated_host_listings_count” and “calculated_host_listings_count_entire_homes”.

listings_clean %>%
select(c(price, number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_communication,review_scores_value,review_scores_location,reviews_per_month)) %>% #we expected these variables to be strongly correlated however they are not
  ggpairs(alpha = 0.4)

Interpretation: From the chunk below, we find a stronger correlation between price, accommodates, beds, bedrooms, availability_30, availability_60, availability_90, and availability_365. However, the variables in groups (1) accommodates, beds and bedrooms and (2) availability_30, availability_60, availability_90, and availability_365 are highly correlated. Therefore, we attempt to remove the noise by only selecting accommodates (15.1% correlation to price) and availability_30 (10.8% correlation to price) in the next graph.

listings_clean %>%
select(c(price, accommodates, beds, bedrooms, availability_30, availability_60,availability_90,availability_365)) %>%
  ggpairs(alpha = 0.4)

Interpretation: Finally, we test other variables, which we do no suspect to have a great influence on price. We surprisingly find that “calculated_host_listings_count” and “calculated_host_listings_count_entire_homes” have an almost perfect correlation at 0.999.

listings_clean %>%
select(c(price, number_of_reviews,number_of_reviews_ltm, number_of_reviews_l30d,calculated_host_listings_count, calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms, reviews_per_month,)) %>%
  ggpairs(alpha = 0.4)

3.5 Property types

3.6 Examining property types

head(listings_clean %>%
  group_by(property_type) %>%
  summarise(count=n()) %>%
  mutate(proportion=count/sum(count)) %>%
  arrange(desc(count))
)
property_typecountproportion
Entire rental unit58030.603 
Entire condominium (condo)12000.125 
Private room in rental unit9720.101 
Entire residential home4650.0483
Entire serviced apartment2740.0285
Entire townhouse2060.0214

Interpretation: We can see that the four most common property types are “Entire rental unit”, “Entire condominium (condo)”, “Private room in rental unit”, “Entire residential home”. Together they account for 8,440 listings which corresponds to 87.68% of the total amount. For simplicity, we will group the remaining ~13% of listings in the category “Other”, by creating a new variable called “prop_type_simplified”.

3.6.1 Creating the variable prop_type_simplified

listings_prop <- listings_clean %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Entire rental unit","Private room in residential home", "Entire residential home","Entire condominium (condo)") ~ property_type, 
    TRUE ~ "Other"
  ))
head(listings_prop %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n))
)
property_typeprop_type_simplifiedn
Entire rental unitEntire rental unit5803
Entire condominium (condo)Entire condominium (condo)1200
Private room in rental unitOther972
Entire residential homeEntire residential home465
Entire serviced apartmentOther274
Entire townhouseOther206

Interpretation: From the table above we can see that all property type categories, except for the top four ones in terms of listings, are transformed into the category “Other” for the prop_type_simplified variable.

3.6.2 Examining the relationship between price and property type

# barplot for property_type vs average price 

listings_prop %>%
  group_by(prop_type_simplified)%>%
  summarise(average_price = mean(price)) %>%
  ggplot(aes(x=prop_type_simplified,y=average_price))+
  geom_col()+
  ggtitle("Average Property Price vs Property Type")+
  geom_text(aes(label = c(1061.12,1125.19,1354.68,954.92,473.51)), vjust = 1.5, colour = "white")+
  NULL

Interpretation: The above histogram shows us the average rental price of properties in each category of the variable prop_type_simplified. Looking at the resulting data, we see that properties in the type “Entire residential home” have the highest average price () and “Private room in residential home” has the lowest average price (). These observations should make intuitive sense, as entire residential homes should costs more than a single room in the same type of home.

listings_prop %>%
  select(price,review_scores_rating, review_scores_location, review_scores_value,
         number_of_reviews, reviews_per_month,
         bedrooms,beds, availability_365) %>%
  ggpairs(alpha=0.5)+
  theme_bw()

Interpretation: “Review_scores_value” and “Review_scores_rating” have the highest significant correlation of 74.1%. This result makes sense as the two variables measure the same item, namely, the visitors’ satisfaction with the accommodation.

3.6.3 Listings that are intended for travel purposes

Questions:
- What are the most common values for the variable minimum_nights? - Is there any value among the common values that stands out? - What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?

Filter the airbnb data so that it only includes observations with minimum_nights <= 4

#most common value for minimum_nights
listings_prop %>%
  group_by(minimum_nights) %>%
  count(sort=TRUE)
# A tibble: 66 × 2
# Groups:   minimum_nights [66]
   minimum_nights     n
            <dbl> <int>
 1              2  2871
 2              3  2149
 3              1  1757
 4              4  1017
 5              5   790
 6              7   368
 7              6   231
 8             14    86
 9             30    75
10             10    46
# … with 56 more rows
summary(as.factor(listings$minimum_nights))
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
1757 2871 2149 1017  790  231  368   13    4   46    3   13   10   86   19    6 
  18   19   20   21   22   23   25   27   28   29   30   31   35   36   39   40 
   3    1   25   13    2    1    8    1   11    4   75    4    4    1    1    3 
  44   45   49   50   56   60   61   66   69   70   75   80   85   89   90   92 
   1    4    1    5    1   18    1    1    1    3    1    2    1    1   17    1 
  99  100  105  110  120  150  160  170  180  200  300  360  365  400  500  600 
   1    3    1    1    1    2    1    1    4    2    2    1    1    1    1    1 
1000 1111 
   1    1 

Interpretation: The most common value for “minimum_nights” among all listings is 2 nights. Furthermore, it is quite surprising that some listings that are intended for travel purposes require a minimum stay of 30 nights, considering that most of the countries and companies worldwide don’t even allow their employees that many vacation days. Most likely, these Airbnb listings are intended for people who live in the specific city only in the short-term, for example for internships, and need an accommodation for that time period.

#filter for data with less than 4 nights 
listings_less_than_4 <- listings_prop %>%
  filter(minimum_nights <= 4)
#correlation (ggpairs with the filter for less than equal to 4 nights)
listings_less_than_4 %>%
 select(price,review_scores_rating, review_scores_location, review_scores_value,
         number_of_reviews, reviews_per_month,
         bedrooms,beds, availability_365) %>%
  ggpairs(alpha=0.5)+
  theme_bw()

4 Mapping

4.1 Map

leaflet(data = filter(listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)
listings%>%
  filter(minimum_nights<=4)%>%
  group_by(neighbourhood_cleansed)%>%
  count()
# A tibble: 11 × 2
# Groups:   neighbourhood_cleansed [11]
   neighbourhood_cleansed        n
   <chr>                     <int>
 1 Amager st                   574
 2 Amager Vest                 739
 3 Bispebjerg                  276
 4 Brnshj-Husum                143
 5 Frederiksberg               780
 6 Indre By                   1414
 7 Nrrebro                    1250
 8 sterbro                     799
 9 Valby                       314
10 Vanlse                      184
11 Vesterbro-Kongens Enghave  1321
listings%>%
  filter(minimum_nights<=4)%>%
  ggplot(aes(x=neighbourhood_cleansed))+
  geom_bar()+
  theme(axis.text.x=element_text(angle=70, size=8, vjust=0.6))+
  labs(title="Distribution of Airbnb homes among neighborhoods",x="Neighbourhood",y="Count")+
  NULL

Interpretation: Looking at the map and the distribution of listings across all neighborhoods, we can observe that, unsurprisingly, listings are more concentrated in the city center and become more scarce the further me move out. Indre By has the most Airbnb listings of all (1414 listings) followed by Nrrebro with 1250 Airbnb listings.

4.2 Relationship between price and neighbourhood

listings_clean_map<-listings_clean%>%
  group_by(neighbourhood_cleansed) %>%                         
  summarise_at(vars(price),              
              list(name = mean))

listings_clean_map%>%
  ggplot(aes(x=neighbourhood_cleansed,y= name))+
  geom_point()+
  theme(axis.text.x=element_text(angle = 70, size=8, vjust=0.6))+
  labs(title="Average rental price per night in each neighbourhood",x="Neighbourhood",y="Average price per night")+
  geom_text(aes(label = c(1000.77,1101.55,695.19,815.97,1143.60,1488.38,885.83,1071.06,921.62,841.43,1050.33)), vjust = 1.5, colour = "black")+
  theme(axis.text.x=element_text(angle=70, size=8, vjust=0.6))+
  theme_bw()+
  NULL

listings_clean%>%
  ggplot(aes(x=neighbourhood_cleansed))+
  geom_boxplot(aes(y= price))+
  labs(title="Range of rental price per night in each neighbourhood",x="Neighbourhood",y="Price per night")+
  theme(axis.text.x=element_text(angle=70, size=8, vjust=0.6))+
  theme_bw()+
  NULL

Interpretation: Looking at the average rental price in each neighbourhood, it becomes clear that Indre By is by far the neighborhood with the highest average rental prices (~DKK 1490 per night). The cheapest Airbnbs, on average, are in Bispjebjerg. When taking a look at the range of the prices in each area, we observe some big outliers (probably due to incorrect reporting of the hosts), which significantly influence the average prices.

5 Setting up for Regression Analysis

To start with, our target variable will be the cost for two people to stay at an Airbnb location for four nights.

5.1 Creating variable ‘price_4_nights’

#creating our output variable "price_4_nights"
regression_df_1 <- listings_prop %>%
  filter(accommodates >= 2, minimum_nights <= 4, maximum_nights >= 4) %>%
  mutate(price_4_nights = price*4) %>%
  filter(!is.na(price_4_nights)) # removing any missing values , but none seem to exist 

Note: To define price_4_nights, we filtered for rooms that accommodate at least 2 people (>=2) because 2 people can also stay in a room with capacity of 3 or more people.

Secondly, since our guests want to stay for 4 nights, we filtered minimum nights lesser or equal to 4 (<=4), because our guests can stay at rooms with minimum nights requirements between 1 and 4.

Moreover, we filtered maximum nights for larger or equal to 4 (>=4), because our guests have to be able to stay for at least 4 nights. The logic behind filtering these variables for the conditions is that we want to get rid of pricing data that is not feasible for our guest requirements.

5.2 Examining distribution of price_4_nights

regression_df_1 %>%
  ggplot(aes(price_4_nights)) +
  geom_histogram(color="black", fill="grey")+
  theme_bw()+
  geom_density(alpha=0.5) +
  NULL

regression_df_2 <- regression_df_1 %>%
  mutate(log_price_4_nights = log(price_4_nights))

regression_df_2 %>%
  ggplot(aes(log_price_4_nights)) +
  geom_histogram(color="black", fill="pink")+
  theme_bw()+
  geom_density(alpha=0.2) +
  NULL

Interpretation:

Going forward we should use log_price_4_nights. As seen from above, logging price_4_nights makes the variable roughly normally distributed. This is desirable because for running a basic OLS regression analysis, one’s input variables should be normally distributed. Put simply, by logging the variable, we are reducing the skewness of the variable price_4_nights.

6 Regression Analysis Part 1: Base specification

6.1 Model 1 - Type of listing

model1 <- lm(log_price_4_nights ~ 
               prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating, 
             data = regression_df_2)
  
summary(model1) 

Call:
lm(formula = log_price_4_nights ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating, data = regression_df_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5087 -0.3199 -0.0543  0.2607  4.7275 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           8.2123361  0.0616266
prop_type_simplifiedEntire rental unit                0.0400720  0.0196014
prop_type_simplifiedEntire residential home           0.3175498  0.0357001
prop_type_simplifiedOther                            -0.2407395  0.0234648
prop_type_simplifiedPrivate room in residential home -0.5872703  0.0694131
number_of_reviews                                    -0.0003745  0.0001664
review_scores_rating                                 -0.0054367  0.0122712
                                                     t value Pr(>|t|)    
(Intercept)                                          133.260   <2e-16 ***
prop_type_simplifiedEntire rental unit                 2.044   0.0410 *  
prop_type_simplifiedEntire residential home            8.895   <2e-16 ***
prop_type_simplifiedOther                            -10.260   <2e-16 ***
prop_type_simplifiedPrivate room in residential home  -8.461   <2e-16 ***
number_of_reviews                                     -2.250   0.0245 *  
review_scores_rating                                  -0.443   0.6577    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5094 on 6455 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.07019,   Adjusted R-squared:  0.06933 
F-statistic: 81.22 on 6 and 6455 DF,  p-value: < 2.2e-16
car::vif(model1)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.027885  4        1.003444
number_of_reviews    1.024992  1        1.012419
review_scores_rating 1.005517  1        1.002755
autoplot(model1)+ theme_bw()

regression_df_2 %>%
  group_by(prop_type_simplified) %>%
  summarise(count=n())
prop_type_simplifiedcount
Entire condominium (condo)970
Entire rental unit4647
Entire residential home292
Other1440
Private room in residential home65

Interpretation:

Having run a simple OLS regression, we can see that reviews_scores_rating is not a significant explanatory variable of log_price_4_nights at the 5% significance level. This can be deduced from the relatively low t value (-0.443) and correspondingly low p-value. Thus, when controlling for property type, reviews do not seem to affect prices in this simple model. This seems intuitive because a property’s review score should not directly affect a listing’s price but rather the willingness of a customer to book said listing. Hence, we would expect it to have a direct relationship with something like occupancy rate.

Moreover, we can see that all dummy variables derived from “prop_type_simplified” are significant at least at the 5% significance level. Thus, they all affect our dependent variable log_price_4_nights. This was to be expected as the size of a rental unit should be a critical contributing factor in determining price. When interpreting the sign of our property types we need to remind ourselves of the base case, which is “Entire condominium (condo)”. In light of this it makes sense that “private” and “other” rooms come at a discount while “entire” rental units and residential homes come at a premium.

We also observe that “number_of_reviews” is significant at the 5% significance level with a t-value of -2.25. However, the effect on price appears negligible when compared to property types.

6.2 Model 2 - Type of room

model2 <- lm(log_price_4_nights ~ 
               prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating + 
               room_type, 
             data = regression_df_2)
  
summary(model2) 

Call:
lm(formula = log_price_4_nights ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + room_type, data = regression_df_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5202 -0.2798 -0.0423  0.2304  4.8236 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           8.2318293  0.0568499
prop_type_simplifiedEntire rental unit                0.0363823  0.0180717
prop_type_simplifiedEntire residential home           0.3183529  0.0329133
prop_type_simplifiedOther                             0.3452995  0.0278946
prop_type_simplifiedPrivate room in residential home  0.3616344  0.0699177
number_of_reviews                                     0.0002402  0.0001547
review_scores_rating                                 -0.0117452  0.0113216
room_typeHotel room                                  -0.2837109  0.1197797
room_typePrivate room                                -0.9668193  0.0287123
room_typeShared room                                 -0.6852324  0.1676615
                                                     t value Pr(>|t|)    
(Intercept)                                          144.799  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 2.013   0.0441 *  
prop_type_simplifiedEntire residential home            9.672  < 2e-16 ***
prop_type_simplifiedOther                             12.379  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home   5.172 2.38e-07 ***
number_of_reviews                                      1.553   0.1206    
review_scores_rating                                  -1.037   0.2996    
room_typeHotel room                                   -2.369   0.0179 *  
room_typePrivate room                                -33.673  < 2e-16 ***
room_typeShared room                                  -4.087 4.42e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4697 on 6452 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.2101,    Adjusted R-squared:  0.209 
F-statistic: 190.6 on 9 and 6452 DF,  p-value: < 2.2e-16
car::vif(model2)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 2.569426  4        1.125199
number_of_reviews    1.042645  1        1.021100
review_scores_rating 1.007005  1        1.003496
room_type            2.602351  3        1.172810
autoplot(model2)+ theme_bw()

regression_df_2 %>%
  group_by(room_type) %>%
  summarise(count=n())
room_typecount
Entire home/apt6486
Hotel room17
Private room901
Shared room10

Interpretation:

From the above we can see that all types of room have a statistically significant negative effect on log_price_4_nights at the 5% level. Private and shared room are both significant at the 1% level. The significant effects were to be expected as the type of room (private vs. shared) should be a big contributing factor in determining price. The signs of our variables have to be interpreted in conjunction with our base case. As can be seen from our room_type overview table, the base case is “Entire home/apt”, thus, it makes sense that all other room types command a reduction in price.

Moreover, as expected, adding room_type as an additional explanatory variable changes the estimates for our other coefficients. While all property types remain significant at the 5% level, number of reviews becomes insignificant. Overall, when controlling for property and room type, both review ratings and number of reviews do not significantly affect our output variable.

6.3 Comparing Model 1 and Model 2

There are a couple of things to note here: review_scores_rating and number_of_reviews are both relavtively insignficant across both models and seem to only have negligble effects. Thus, they do not appear to be an explanatory variable of log_price_4_nights. However, for our subsequent analysis we will keep them as a control variable, as when dropped into the residual they might otherwise induce omitted variable bias. After all, beyond size and type of property, price will be most importantly affected by the underlying quality of the listing for which reviews tend to be a good indicator. Thus going forward, we will keep them in our model specification.

When looking at the VIF of prop_type_simplified we see that it increases significantly when introducing room_type. This makes sense, since there is some overlap in these two variables in that both of them give an indication of whether a listing is private or shared. However, the VIF is still small and within in the acceptable range of below 5. Moreover, our adjusted R-squared goes up by a large margin, from roughly 7% to 21%. Thus, adding room_type drastically improves the explanatory power of our model, which is why will keep it as a regressor going forward.

From the residual plots for Model 1 and Model 2 we can however clearly see that there seems to be a relationship within the data that has not been accounted for yet. Moreover, both our scale-location show a similar clustering like the residual plots, indicating that variability is not the same for all levels of price. This is why we will now investigate further explanatory variables that may or may not improve our model’s explanatory power.

7 Regression Analysis Part 2: Further variables

7.1 Model 3 - Size of the listing

Question: Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?

Note: In our data set, the number of bathrooms was not recorded across all observations. Thus, we cannot analyse this variable Moreover, intuitively, the number of people a listing can accommodate will closely correlate with the listing’s number of beds and bedrooms. For this we will first estimate a regression model that includes all variables and then investigate their relationships further through a correlation analysis. We will then reason why we only keep one of them.

model3_0 <- lm(log_price_4_nights ~ 
               prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating +
               room_type +
               beds +
               bedrooms +
               accommodates, 
             data=regression_df_2
             )

msummary(model3_0)
                                                       Estimate Std. Error
(Intercept)                                           7.799e+00  5.341e-02
prop_type_simplifiedEntire rental unit                1.586e-02  1.628e-02
prop_type_simplifiedEntire residential home          -1.006e-01  3.126e-02
prop_type_simplifiedOther                             1.406e-01  2.561e-02
prop_type_simplifiedPrivate room in residential home  1.059e-01  6.366e-02
number_of_reviews                                    -3.008e-05  1.395e-04
review_scores_rating                                 -1.507e-02  1.041e-02
room_typeHotel room                                  -5.468e-02  1.071e-01
room_typePrivate room                                -6.094e-01  2.747e-02
room_typeShared room                                 -4.199e-01  1.497e-01
beds                                                  7.102e-03  5.586e-03
bedrooms                                              1.160e-01  1.041e-02
accommodates                                          7.912e-02  5.738e-03
                                                     t value Pr(>|t|)    
(Intercept)                                          146.005  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 0.974  0.32986    
prop_type_simplifiedEntire residential home           -3.219  0.00129 ** 
prop_type_simplifiedOther                              5.489  4.2e-08 ***
prop_type_simplifiedPrivate room in residential home   1.663  0.09627 .  
number_of_reviews                                     -0.216  0.82929    
review_scores_rating                                  -1.448  0.14770    
room_typeHotel room                                   -0.510  0.60982    
room_typePrivate room                                -22.182  < 2e-16 ***
room_typeShared room                                  -2.806  0.00504 ** 
beds                                                   1.271  0.20362    
bedrooms                                              11.145  < 2e-16 ***
accommodates                                          13.789  < 2e-16 ***

Residual standard error: 0.4183 on 6306 degrees of freedom
  (1095 observations deleted due to missingness)
Multiple R-squared:  0.3735,    Adjusted R-squared:  0.3723 
F-statistic: 313.2 on 12 and 6306 DF,  p-value: < 2.2e-16
autoplot(model3_0)+ theme_bw()

car::vif(model3_0)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 3.014135  4        1.147877
number_of_reviews    1.054560  1        1.026918
review_scores_rating 1.007912  1        1.003948
room_type            2.891021  3        1.193553
beds                 2.492727  1        1.578837
bedrooms             3.337548  1        1.826896
accommodates         3.585671  1        1.893587
regression_df_2 %>%
  select(c(accommodates, beds, bedrooms)) %>%
  ggpairs(alpha = 0.3)

Interpretation:

Our regression that includes all 3 indicators of size shows that beds is insignificant at the 5% significance level. Moreover, when looking at the VIF, we can see that beds, bedrooms and accommodates have quite high factors. We expect that by removing two out three variables, the VIF of the remaining variable will be significantly lower.

From the correlation analysis we can see that there is a relatively strong correlation across all three variables. This seems intuitive as they all give an indication for the exact same thing, namely the size of the listing. Therefore, going forward, we will will only use accommodate as a proxy for size as it has the highest correlation with the other two respective variables.

model3 <- lm(log(price_4_nights) ~ prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating +
               room_type +
               accommodates, 
             data=regression_df_2
             )

msummary(model3)
                                                       Estimate Std. Error
(Intercept)                                           7.7777527  0.0523895
prop_type_simplifiedEntire rental unit                0.0217315  0.0162460
prop_type_simplifiedEntire residential home          -0.0298486  0.0308851
prop_type_simplifiedOther                             0.1749336  0.0254437
prop_type_simplifiedPrivate room in residential home  0.1494879  0.0630701
number_of_reviews                                    -0.0001015  0.0001393
review_scores_rating                                 -0.0086904  0.0101754
room_typeHotel room                                  -0.1529111  0.1077020
room_typePrivate room                                -0.6378892  0.0271346
room_typeShared room                                 -0.4974142  0.1507595
accommodates                                          0.1317883  0.0033617
                                                     t value Pr(>|t|)    
(Intercept)                                          148.460  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 1.338 0.181056    
prop_type_simplifiedEntire residential home           -0.966 0.333860    
prop_type_simplifiedOther                              6.875 6.77e-12 ***
prop_type_simplifiedPrivate room in residential home   2.370 0.017808 *  
number_of_reviews                                     -0.729 0.466319    
review_scores_rating                                  -0.854 0.393103    
room_typeHotel room                                   -1.420 0.155726    
room_typePrivate room                                -23.508  < 2e-16 ***
room_typeShared room                                  -3.299 0.000974 ***
accommodates                                          39.203  < 2e-16 ***

Residual standard error: 0.4221 on 6451 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.362, Adjusted R-squared:  0.3611 
F-statistic: 366.1 on 10 and 6451 DF,  p-value: < 2.2e-16
car::vif(model3)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 2.919242  4        1.143296
number_of_reviews    1.046742  1        1.023104
review_scores_rating 1.007064  1        1.003526
room_type            2.878473  3        1.192688
accommodates         1.227559  1        1.107953
autoplot(model3)

Interpretation:

From the regression output above, we can see that the VIF of accomodates has dropped significantly. This means by remooving beds and bedrooms it appears that we have adequately addressed potential issues of multicolinearity. Additionally, accomodates is a highly significant variable with a t-value of 39.2. This makes sense since the price of a listing will inevitably be affected by the number of people than stay there. The sign of impact is positive, which also seems intuitive as one would expect a larger property to also command a higher price.

7.2 Model 4 - Superhosts

Question: Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?

model4 <- lm(log(price_4_nights) ~ prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating +
               room_type +
               accommodates +
               host_is_superhost, 
             data=regression_df_2
             )

msummary(model4)
                                                       Estimate Std. Error
(Intercept)                                           7.7770591  0.0528731
prop_type_simplifiedEntire rental unit                0.0216563  0.0162519
prop_type_simplifiedEntire residential home          -0.0298041  0.0308931
prop_type_simplifiedOther                             0.1748245  0.0254566
prop_type_simplifiedPrivate room in residential home  0.1491981  0.0630982
number_of_reviews                                    -0.0001104  0.0001452
review_scores_rating                                 -0.0086089  0.0103041
room_typeHotel room                                  -0.1525479  0.1077354
room_typePrivate room                                -0.6380485  0.0271534
room_typeShared room                                 -0.4972827  0.1507944
accommodates                                          0.1317985  0.0033635
host_is_superhostTRUE                                 0.0033037  0.0149302
                                                     t value Pr(>|t|)    
(Intercept)                                          147.089  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 1.333  0.18273    
prop_type_simplifiedEntire residential home           -0.965  0.33471    
prop_type_simplifiedOther                              6.868 7.14e-12 ***
prop_type_simplifiedPrivate room in residential home   2.365  0.01808 *  
number_of_reviews                                     -0.760  0.44706    
review_scores_rating                                  -0.835  0.40348    
room_typeHotel room                                   -1.416  0.15684    
room_typePrivate room                                -23.498  < 2e-16 ***
room_typeShared room                                  -3.298  0.00098 ***
accommodates                                          39.185  < 2e-16 ***
host_is_superhostTRUE                                  0.221  0.82488    

Residual standard error: 0.4222 on 6448 degrees of freedom
  (954 observations deleted due to missingness)
Multiple R-squared:  0.362, Adjusted R-squared:  0.3609 
F-statistic: 332.6 on 11 and 6448 DF,  p-value: < 2.2e-16
car::vif(model4)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 2.922263  4        1.143444
number_of_reviews    1.135800  1        1.065739
review_scores_rating 1.018817  1        1.009365
room_type            2.882026  3        1.192934
accommodates         1.227787  1        1.108055
host_is_superhost    1.118070  1        1.057388
autoplot(model4)

Interpretation:

From the above output we can see a couple of things: First, host_is_superhost is not a significant explanatory variable for price_4_nights at the 5% significance level. Secondly, while VIF is not showing any signs of colinerarity, R2 has not changed compared to our latest model (model3) and the RSME even slightly increases. We also do not see any changes in our residual and normal Q-Q plots. This all indicates that host_is_superhost is not adding any new information to the model. We also do not suspect that host_is_superhost accounts for any omitted bias which would necessitate to include it as a control variable. Overall, for the reasons above we will drop this variable going forward.

7.3 Model 5 - Instant bookable

Question: Some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?

model5 <- lm(log(price_4_nights) ~ prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating +
               room_type +
               accommodates +
               instant_bookable, 
             data=regression_df_2
             )

msummary(model5)
                                                       Estimate Std. Error
(Intercept)                                           7.7778975  0.0525623
prop_type_simplifiedEntire rental unit                0.0217313  0.0162473
prop_type_simplifiedEntire residential home          -0.0298759  0.0308977
prop_type_simplifiedOther                             0.1749135  0.0254524
prop_type_simplifiedPrivate room in residential home  0.1494817  0.0630753
number_of_reviews                                    -0.0001013  0.0001394
review_scores_rating                                 -0.0087098  0.0101917
room_typeHotel room                                  -0.1526379  0.1080024
room_typePrivate room                                -0.6378270  0.0271968
room_typeShared room                                 -0.4973088  0.1508023
accommodates                                          0.1317959  0.0033692
instant_bookableTRUE                                 -0.0004678  0.0135925
                                                     t value Pr(>|t|)    
(Intercept)                                          147.975  < 2e-16 ***
prop_type_simplifiedEntire rental unit                 1.338  0.18109    
prop_type_simplifiedEntire residential home           -0.967  0.33362    
prop_type_simplifiedOther                              6.872 6.92e-12 ***
prop_type_simplifiedPrivate room in residential home   2.370  0.01782 *  
number_of_reviews                                     -0.727  0.46748    
review_scores_rating                                  -0.855  0.39281    
room_typeHotel room                                   -1.413  0.15762    
room_typePrivate room                                -23.452  < 2e-16 ***
room_typeShared room                                  -3.298  0.00098 ***
accommodates                                          39.118  < 2e-16 ***
instant_bookableTRUE                                  -0.034  0.97254    

Residual standard error: 0.4221 on 6450 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.362, Adjusted R-squared:  0.361 
F-statistic: 332.8 on 11 and 6450 DF,  p-value: < 2.2e-16
car::vif(model5)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 2.923420  4        1.143501
number_of_reviews    1.048260  1        1.023846
review_scores_rating 1.010129  1        1.005052
room_type            2.903501  3        1.194411
accommodates         1.232844  1        1.110335
instant_bookable     1.020213  1        1.010056
autoplot(model5)

Interpretation:

From the above output we can see that instant_bookable is not a significant explanatory variable for price_4_nights at the 5% significance level. Moreover, like for host_is_superhost we do not see any improvement in the explanatory power in our model as R2 and RMSE do not change. Given that (instant_bookable == FALSE) only affects a very small fraction of our observations, this was to be expected. Regarding colinearity, there seems to be no significant correlation between instant_bookable and other explanatory variables due to all GVIF values are less than 5. Due to all this, we will not consider this variable going forward.

7.4 Model 6 - Neighbourhoods

Determining whether location is a predictor of price_4_nights.

7.4.1 Creating new variable neighbourhood_cleansed_simplified

regression_df_clean_neighbourhood <- regression_df_2 %>%
  mutate(neighbourhood_cleansed_simplified = case_when(
    neighbourhood_cleansed %in% c("Indre By",
                                  "Vesterbro-Kongens Enghave", 
                                  "Nrrebro","sterbro", 
                                  "Frederiksberg", 
                                  "Amager Vest") ~ neighbourhood_cleansed, 
    TRUE ~ "Other"
  ))

regression_df_clean_neighbourhood %>%
  count(neighbourhood_cleansed_simplified) %>%
  arrange(desc(n)) 
neighbourhood_cleansed_simplifiedn
Other1386
Indre By1374
Vesterbro-Kongens Enghave1253
Nrrebro1206
sterbro760
Frederiksberg739
Amager Vest696

7.4.2 Running Model 6

model6 <- lm(log(price_4_nights) ~ prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating +
               room_type +
               accommodates +
               neighbourhood_cleansed_simplified, 
             data=regression_df_clean_neighbourhood
             )

msummary(model6)
                                                             Estimate
(Intercept)                                                 7.7667492
prop_type_simplifiedEntire rental unit                      0.0086385
prop_type_simplifiedEntire residential home                 0.1004274
prop_type_simplifiedOther                                   0.1376155
prop_type_simplifiedPrivate room in residential home        0.2509233
number_of_reviews                                          -0.0005483
review_scores_rating                                       -0.0047721
room_typeHotel room                                        -0.2417652
room_typePrivate room                                      -0.5815066
room_typeShared room                                       -0.3961472
accommodates                                                0.1254778
neighbourhood_cleansed_simplifiedFrederiksberg              0.0415391
neighbourhood_cleansed_simplifiedIndre By                   0.2968230
neighbourhood_cleansed_simplifiedNrrebro                   -0.0399046
neighbourhood_cleansed_simplifiedOther                     -0.1981164
neighbourhood_cleansed_simplifiedsterbro                    0.0422716
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave  0.0461336
                                                           Std. Error t value
(Intercept)                                                 0.0515102 150.781
prop_type_simplifiedEntire rental unit                      0.0152074   0.568
prop_type_simplifiedEntire residential home                 0.0296323   3.389
prop_type_simplifiedOther                                   0.0239592   5.744
prop_type_simplifiedPrivate room in residential home        0.0595529   4.213
number_of_reviews                                           0.0001315  -4.169
review_scores_rating                                        0.0095271  -0.501
room_typeHotel room                                         0.1008744  -2.397
room_typePrivate room                                       0.0255720 -22.740
room_typeShared room                                        0.1411430  -2.807
accommodates                                                0.0031598  39.711
neighbourhood_cleansed_simplifiedFrederiksberg              0.0228024   1.822
neighbourhood_cleansed_simplifiedIndre By                   0.0202756  14.639
neighbourhood_cleansed_simplifiedNrrebro                    0.0205713  -1.940
neighbourhood_cleansed_simplifiedOther                      0.0199886  -9.911
neighbourhood_cleansed_simplifiedsterbro                    0.0227066   1.862
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave  0.0205580   2.244
                                                           Pr(>|t|)    
(Intercept)                                                 < 2e-16 ***
prop_type_simplifiedEntire rental unit                     0.570025    
prop_type_simplifiedEntire residential home                0.000705 ***
prop_type_simplifiedOther                                  9.68e-09 ***
prop_type_simplifiedPrivate room in residential home       2.55e-05 ***
number_of_reviews                                          3.10e-05 ***
review_scores_rating                                       0.616462    
room_typeHotel room                                        0.016572 *  
room_typePrivate room                                       < 2e-16 ***
room_typeShared room                                       0.005020 ** 
accommodates                                                < 2e-16 ***
neighbourhood_cleansed_simplifiedFrederiksberg             0.068547 .  
neighbourhood_cleansed_simplifiedIndre By                   < 2e-16 ***
neighbourhood_cleansed_simplifiedNrrebro                   0.052446 .  
neighbourhood_cleansed_simplifiedOther                      < 2e-16 ***
neighbourhood_cleansed_simplifiedsterbro                   0.062699 .  
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave 0.024861 *  

Residual standard error: 0.3948 on 6445 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.4423,    Adjusted R-squared:  0.4409 
F-statistic: 319.5 on 16 and 6445 DF,  p-value: < 2.2e-16
car::vif(model6)
                                      GVIF Df GVIF^(1/(2*Df))
prop_type_simplified              3.232083  4        1.157938
number_of_reviews                 1.066288  1        1.032612
review_scores_rating              1.008979  1        1.004480
room_type                         2.939444  3        1.196862
accommodates                      1.239477  1        1.113318
neighbourhood_cleansed_simplified 1.172223  6        1.013330
autoplot(model6)

Interpretation:

Immediately we notice how compared to our current preferred model (model3) our adjusted R2 increases by roughly 0.08 and that our RMSE falls from 0.4221 to 0.3948. Moreover, all of our neighborhood dummy variables are significant explanatory variables at the 10% level.The large impact on the explanatory power of our model when adding this variable makes intuitive semse. Like with real estate, prices for accomodation will vary by region. Some neighborhoods will ask for a price premium due to their proxmitiy to the city centre, others because the neighbourhood is clean and affluent.

Additionally, multicolinearity seems to be of no issue given that all GVIF factors remain below 5 and since neighbourhood_cleansed_simplified itself only has a VIF of 1.17. Lastly, when looking at the residuals vs fitted and scale-location plots we can see that they are almost random. This effect on our model was also to be expected given neighborhood’s expalantory power and the categorical distribution across listings. For model 3, one could almost witness patterns of vertical lines. Adding information on location has seemingly reduced a large part of these remaining patterns.

Hence, going forward, model6 is now our preferred model.

7.5 Model 7 - Availability

Note: availability_30 is defined as the availability of the listing 30 days in the future as determined by the calendar.

model7 <- lm(log(price_4_nights) ~ prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating +
               room_type +
               accommodates +
               neighbourhood_cleansed_simplified+
               availability_30, 
             data=regression_df_clean_neighbourhood
             )

msummary(model7)
                                                             Estimate
(Intercept)                                                 7.5357940
prop_type_simplifiedEntire rental unit                      0.0097386
prop_type_simplifiedEntire residential home                 0.0999257
prop_type_simplifiedOther                                   0.1270435
prop_type_simplifiedPrivate room in residential home        0.2199067
number_of_reviews                                          -0.0004349
review_scores_rating                                        0.0227773
room_typeHotel room                                        -0.5402829
room_typePrivate room                                      -0.6107811
room_typeShared room                                       -0.6143250
accommodates                                                0.1260141
neighbourhood_cleansed_simplifiedFrederiksberg              0.0547302
neighbourhood_cleansed_simplifiedIndre By                   0.2729825
neighbourhood_cleansed_simplifiedNrrebro                   -0.0112098
neighbourhood_cleansed_simplifiedOther                     -0.1901813
neighbourhood_cleansed_simplifiedsterbro                    0.0377509
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave  0.0704718
availability_30                                             0.0168311
                                                           Std. Error t value
(Intercept)                                                 0.0481318 156.566
prop_type_simplifiedEntire rental unit                      0.0140600   0.693
prop_type_simplifiedEntire residential home                 0.0273964   3.647
prop_type_simplifiedOther                                   0.0221537   5.735
prop_type_simplifiedPrivate room in residential home        0.0550673   3.993
number_of_reviews                                           0.0001217  -3.574
review_scores_rating                                        0.0088475   2.574
room_typeHotel room                                         0.0936979  -5.766
room_typePrivate room                                       0.0236590 -25.816
room_typeShared room                                        0.1306594  -4.702
accommodates                                                0.0029214  43.135
neighbourhood_cleansed_simplifiedFrederiksberg              0.0210856   2.596
neighbourhood_cleansed_simplifiedIndre By                   0.0187596  14.552
neighbourhood_cleansed_simplifiedNrrebro                    0.0190389  -0.589
neighbourhood_cleansed_simplifiedOther                      0.0184820 -10.290
neighbourhood_cleansed_simplifiedsterbro                    0.0209937   1.798
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave  0.0190210   3.705
availability_30                                             0.0005084  33.105
                                                           Pr(>|t|)    
(Intercept)                                                 < 2e-16 ***
prop_type_simplifiedEntire rental unit                     0.488558    
prop_type_simplifiedEntire residential home                0.000267 ***
prop_type_simplifiedOther                                  1.02e-08 ***
prop_type_simplifiedPrivate room in residential home       6.59e-05 ***
number_of_reviews                                          0.000353 ***
review_scores_rating                                       0.010062 *  
room_typeHotel room                                        8.48e-09 ***
room_typePrivate room                                       < 2e-16 ***
room_typeShared room                                       2.63e-06 ***
accommodates                                                < 2e-16 ***
neighbourhood_cleansed_simplifiedFrederiksberg             0.009464 ** 
neighbourhood_cleansed_simplifiedIndre By                   < 2e-16 ***
neighbourhood_cleansed_simplifiedNrrebro                   0.556026    
neighbourhood_cleansed_simplifiedOther                      < 2e-16 ***
neighbourhood_cleansed_simplifiedsterbro                   0.072191 .  
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave 0.000213 ***
availability_30                                             < 2e-16 ***

Residual standard error: 0.3651 on 6444 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.5234,    Adjusted R-squared:  0.5221 
F-statistic: 416.3 on 17 and 6444 DF,  p-value: < 2.2e-16
car::vif(model7)
                                      GVIF Df GVIF^(1/(2*Df))
prop_type_simplified              3.233689  4        1.158010
number_of_reviews                 1.067135  1        1.033022
review_scores_rating              1.017985  1        1.008953
room_type                         2.974696  3        1.199243
accommodates                      1.239515  1        1.113335
neighbourhood_cleansed_simplified 1.188643  6        1.014505
availability_30                   1.047300  1        1.023377
autoplot(model7)

Interpretation:

We can see how compared to our current preferred model (model6) our adjusted R2 increases even further, roughly by 0.08, and that our RMSE falls significantly as well (from 0.3948. to 0.3651). availability_30 is also a significant explanatory variable at the 1% level for price.

Multicolinearity seems to be of no issue as well with GVIFs remaining below 5 across the board. When looking at the residuals vs fitted and scale-location plots we can see that adding availability_30 has further improved our model regarding the independence of error terms assumption. Therefore, model7 is now our preferred model.

7.6 Model 8 - Reviews per month

model8 <- lm(log(price_4_nights) ~ prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating +
               room_type +
               accommodates +
               neighbourhood_cleansed_simplified+
               availability_30+
               reviews_per_month, 
             data=regression_df_clean_neighbourhood
             )

msummary(model8)
                                                             Estimate
(Intercept)                                                 7.5480308
prop_type_simplifiedEntire rental unit                     -0.0018942
prop_type_simplifiedEntire residential home                 0.0838592
prop_type_simplifiedOther                                   0.1187630
prop_type_simplifiedPrivate room in residential home        0.2172859
number_of_reviews                                          -0.0001881
review_scores_rating                                        0.0246266
room_typeHotel room                                        -0.5143128
room_typePrivate room                                      -0.6092725
room_typeShared room                                       -0.6064637
accommodates                                                0.1256042
neighbourhood_cleansed_simplifiedFrederiksberg              0.0541359
neighbourhood_cleansed_simplifiedIndre By                   0.2790454
neighbourhood_cleansed_simplifiedNrrebro                   -0.0087199
neighbourhood_cleansed_simplifiedOther                     -0.1865979
neighbourhood_cleansed_simplifiedsterbro                    0.0382876
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave  0.0725633
availability_30                                             0.0169401
reviews_per_month                                          -0.0211813
                                                           Std. Error t value
(Intercept)                                                 0.0480596 157.056
prop_type_simplifiedEntire rental unit                      0.0141703  -0.134
prop_type_simplifiedEntire residential home                 0.0274710   3.053
prop_type_simplifiedOther                                   0.0221457   5.363
prop_type_simplifiedPrivate room in residential home        0.0549327   3.955
number_of_reviews                                           0.0001287  -1.461
review_scores_rating                                        0.0088314   2.789
room_typeHotel room                                         0.0935747  -5.496
room_typePrivate room                                       0.0236019 -25.815
room_typeShared room                                        0.1303428  -4.653
accommodates                                                0.0029150  43.089
neighbourhood_cleansed_simplifiedFrederiksberg              0.0210336   2.574
neighbourhood_cleansed_simplifiedIndre By                   0.0187427  14.888
neighbourhood_cleansed_simplifiedNrrebro                    0.0189966  -0.459
neighbourhood_cleansed_simplifiedOther                      0.0184467 -10.116
neighbourhood_cleansed_simplifiedsterbro                    0.0209419   1.828
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave  0.0189773   3.824
availability_30                                             0.0005075  33.378
reviews_per_month                                           0.0036839  -5.750
                                                           Pr(>|t|)    
(Intercept)                                                 < 2e-16 ***
prop_type_simplifiedEntire rental unit                     0.893665    
prop_type_simplifiedEntire residential home                0.002278 ** 
prop_type_simplifiedOther                                  8.48e-08 ***
prop_type_simplifiedPrivate room in residential home       7.72e-05 ***
number_of_reviews                                          0.144079    
review_scores_rating                                       0.005310 ** 
room_typeHotel room                                        4.03e-08 ***
room_typePrivate room                                       < 2e-16 ***
room_typeShared room                                       3.34e-06 ***
accommodates                                                < 2e-16 ***
neighbourhood_cleansed_simplifiedFrederiksberg             0.010082 *  
neighbourhood_cleansed_simplifiedIndre By                   < 2e-16 ***
neighbourhood_cleansed_simplifiedNrrebro                   0.646234    
neighbourhood_cleansed_simplifiedOther                      < 2e-16 ***
neighbourhood_cleansed_simplifiedsterbro                   0.067553 .  
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave 0.000133 ***
availability_30                                             < 2e-16 ***
reviews_per_month                                          9.35e-09 ***

Residual standard error: 0.3641 on 6443 degrees of freedom
  (952 observations deleted due to missingness)
Multiple R-squared:  0.5258,    Adjusted R-squared:  0.5245 
F-statistic: 396.9 on 18 and 6443 DF,  p-value: < 2.2e-16
car::vif(model8)
                                      GVIF Df GVIF^(1/(2*Df))
prop_type_simplified              3.310160  4        1.161398
number_of_reviews                 1.200643  1        1.095738
review_scores_rating              1.019337  1        1.009622
room_type                         2.981912  3        1.199727
accommodates                      1.240257  1        1.113668
neighbourhood_cleansed_simplified 1.196300  6        1.015048
availability_30                   1.048762  1        1.024091
reviews_per_month                 1.185969  1        1.089022
autoplot(model8)

Interpretation:

From the above, we can see reviews_per_month only marginally improves the explanatory power of our model. Computing AICs in the following section will help us identify whether model8 should be preferred over model7.

8 Model Selection

8.1 Summary Table

library(huxtable)
huxreg(model1, model2, model3, model4, model5, model6, model7, model8, 
                 statistics = c('R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'AIC' = 'AIC'), 
                 bold_signif = 0.05
       ) %>% 
  set_caption('Comparison of models')
Comparison of models
(1)(2)(3)(4)(5)(6)(7)(8)
(Intercept)8.212 ***8.232 ***7.778 ***7.777 ***7.778 ***7.767 ***7.536 ***7.548 ***
(0.062)   (0.057)   (0.052)   (0.053)   (0.053)   (0.052)   (0.048)   (0.048)   
prop_type_simplifiedEntire rental unit0.040 *  0.036 *  0.022    0.022    0.022    0.009    0.010    -0.002    
(0.020)   (0.018)   (0.016)   (0.016)   (0.016)   (0.015)   (0.014)   (0.014)   
prop_type_simplifiedEntire residential home0.318 ***0.318 ***-0.030    -0.030    -0.030    0.100 ***0.100 ***0.084 ** 
(0.036)   (0.033)   (0.031)   (0.031)   (0.031)   (0.030)   (0.027)   (0.027)   
prop_type_simplifiedOther-0.241 ***0.345 ***0.175 ***0.175 ***0.175 ***0.138 ***0.127 ***0.119 ***
(0.023)   (0.028)   (0.025)   (0.025)   (0.025)   (0.024)   (0.022)   (0.022)   
prop_type_simplifiedPrivate room in residential home-0.587 ***0.362 ***0.149 *  0.149 *  0.149 *  0.251 ***0.220 ***0.217 ***
(0.069)   (0.070)   (0.063)   (0.063)   (0.063)   (0.060)   (0.055)   (0.055)   
number_of_reviews-0.000 *  0.000    -0.000    -0.000    -0.000    -0.001 ***-0.000 ***-0.000    
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   
review_scores_rating-0.005    -0.012    -0.009    -0.009    -0.009    -0.005    0.023 *  0.025 ** 
(0.012)   (0.011)   (0.010)   (0.010)   (0.010)   (0.010)   (0.009)   (0.009)   
room_typeHotel room        -0.284 *  -0.153    -0.153    -0.153    -0.242 *  -0.540 ***-0.514 ***
        (0.120)   (0.108)   (0.108)   (0.108)   (0.101)   (0.094)   (0.094)   
room_typePrivate room        -0.967 ***-0.638 ***-0.638 ***-0.638 ***-0.582 ***-0.611 ***-0.609 ***
        (0.029)   (0.027)   (0.027)   (0.027)   (0.026)   (0.024)   (0.024)   
room_typeShared room        -0.685 ***-0.497 ***-0.497 ***-0.497 ***-0.396 ** -0.614 ***-0.606 ***
        (0.168)   (0.151)   (0.151)   (0.151)   (0.141)   (0.131)   (0.130)   
accommodates                0.132 ***0.132 ***0.132 ***0.125 ***0.126 ***0.126 ***
                (0.003)   (0.003)   (0.003)   (0.003)   (0.003)   (0.003)   
host_is_superhostTRUE                        0.003                                    
                        (0.015)                                   
instant_bookableTRUE                                -0.000                            
                                (0.014)                           
neighbourhood_cleansed_simplifiedFrederiksberg                                        0.042    0.055 ** 0.054 *  
                                        (0.023)   (0.021)   (0.021)   
neighbourhood_cleansed_simplifiedIndre By                                        0.297 ***0.273 ***0.279 ***
                                        (0.020)   (0.019)   (0.019)   
neighbourhood_cleansed_simplifiedNrrebro                                        -0.040    -0.011    -0.009    
                                        (0.021)   (0.019)   (0.019)   
neighbourhood_cleansed_simplifiedOther                                        -0.198 ***-0.190 ***-0.187 ***
                                        (0.020)   (0.018)   (0.018)   
neighbourhood_cleansed_simplifiedsterbro                                        0.042    0.038    0.038    
                                        (0.023)   (0.021)   (0.021)   
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave                                        0.046 *  0.070 ***0.073 ***
                                        (0.021)   (0.019)   (0.019)   
availability_30                                                0.017 ***0.017 ***
                                                (0.001)   (0.001)   
reviews_per_month                                                        -0.021 ***
                                                        (0.004)   
R squared0.070    0.210    0.362    0.362    0.362    0.442    0.523    0.526    
Adj. R Squared0.069    0.209    0.361    0.361    0.361    0.441    0.522    0.524    
Residual SE0.509    0.470    0.422    0.422    0.422    0.395    0.365    0.364    
AIC9630.980    8583.554    7204.709    7206.343    7206.708    6347.539    5334.606    5303.533    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Interpretation:

As we can see from the table above, our interpretations throughout have also been confirmed by the computed AICs (Akaike Information Criterions). The AIC stipulates that the best model is the one where the least amount of explanatory variables explain the greatest amount of variation. The smaller the AIC, the better. From the table above we can see that the AIC has fallen continuously as we moved from one preferred model to the next. This also helps us to identify model8 as our best overall model.

8.2 Model prediction

Question: Suppose you are planning to visit the city you have been assigned to over reading week, and you want to stay in an Airbnb. Find Airbnb’s in your destination city that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90. Use your best model to predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction. Report the point prediction and interval in terms of price_4_nights.

staying <- 
  
  # Create new data frame with imaginary scenario
  tibble(prop_type_simplified = "Entire rental unit", 
         number_of_reviews = 10,
         review_scores_rating = 90,
         room_type = "Private room",
         accommodates = 2,
         neighbourhood_cleansed_simplified = "Indre By",
         availability_30 =5.867413, 
         reviews_per_month = 0.9128969  
         )
                         
model_prediction <-data.frame(predict(model8, newdata = staying, interval = "prediction")) %>%
  #accounting for log transformation in price via exp() function
  mutate(price_4_nights = exp(fit)*0.16,
                CI_lower = exp(lwr)*0.16,
                CI_upper = exp(upr)*0.16) %>%
  select(-fit, -lwr, -upr)

model_prediction
price_4_nightsCI_lowerCI_upper
2.78e+035381.43e+04

Interpretation:

For our model8, we specified to have a private room in an entire rental unit. We put number of reviews equal to 10, average review rating to 90 and accommodation to 2. We want a room in the best neighbourhood (Indre By). Also, we specified availability_30 and reviews_per_month to be equal to their respective means.

Accounting for our log transformation of price and an exchange rate of currently 0.16 (Danish Krone to USD), we get an expected price_4_nights of c.2776 USD. This translates into c.694 USD per day. This is a lot, but we are renting an entire unit in the best part of Copenhagen, which is quite an expensive city in itself.

8.3 Additional Steps that might improve our analysis

Our 95% confidence interval for price_4_nightsranges from c.134 USD per day to c.3585 USD per day. This shows that there is still great variability in our model and hence it would need to be improved further. Most importantly, however, there will be some prices that have been recorded incorrectly on both ends, which may cause our results to be heavily skewed. This would require further investigation which, due to time constraints, was beyond the scope of this project.

Secondly, further analysis could be done by testing the remaining variables within the dataset and see whether they improve our model’s explanatory power. Also, we have not looked at any interaction between variables, for example the interaction between room type and neighborhood would be an interesting one to observe given that we would then know the effect of each room type in a specific neighborhood. Moreover, one could also introduce data from outside this current set, e.g. a dummy variable whether a hotel is within close range.

Thirdly, this is only a very simple regression model of price. One could investigate whether estimators other than OLS would make more sense for our specific scenario. Also, if we had time series data on AirBnb prices wihtin Copenhangen, one could account for things like seasons, for example, which would allow us to determine the best time of the year to travel to Copenhagen.

9 Acknowledgements